[feature/semantic_text] Register semantic text sub fields in the mapping #106560

jimczi · 2024-03-20T15:30:38Z

This PR refactors the semantic text field mapper to register its sub fields in the mapping instead of re-creating them each time when parsing documents.
It also fixes the generation of these fields in case the semantic text field is defined in an object field.
Lastly this change adds a new section called model_settings in the field parameter that is updated by the field mapper when inference results are received from a bulk action. The model settings are available in the fields as soon as the first document with the inference field is ingested and they are used to validate that updates. They are used to ensure consistency between what's used in the bulk action and what's defined in the field mapping.

Note: This PR is opened against the feature branch: feature/semantic_text

…erver (elastic#105012)" This reverts commit f4d3ab9.

…copy-to-support-inference # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java # server/src/test/java/org/elasticsearch/action/bulk/BulkOperationTests.java

…-fields in the mapping

…instead of re-creating them each time.

…gister_semantic_text

carlosdelest

LGTM, but I'd like to check if we can avoid serializing inference_id on model settings for the field mapper, to avoid duplicating that information in the mapping.

...nference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java

...nce/src/test/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapperTests.java

Mikep86

Left some comments, none that I think block merging though

...nce/src/main/java/org/elasticsearch/xpack/inference/mapper/InferenceMetadataFieldMapper.java

Mikep86 · 2024-03-20T19:46:18Z

...nce/src/main/java/org/elasticsearch/xpack/inference/mapper/InferenceMetadataFieldMapper.java

+        for (XContentParser.Token token = parser.nextToken(); token != XContentParser.Token.END_OBJECT; token = parser.nextToken()) {
+            switch (parser.currentName()) {
+                case RESULTS -> parseResultsList(xContentLocation, parser, context, nestedMapper);
+                default -> throw new DocumentParsingException(xContentLocation, "Unknown field name " + parser.currentName());


We originally skipped unknown field names to be forward-compatible with any new fields added in the future. Any concerns about that?

This PR preserves this behaviour but only for the _inference.$field_name.results object. I am not a fan of this leniency but it is needed since we'll want to extend the information we provide in the future. I thought that we can be strict on the top level fields here since they should not change.

...nce/src/main/java/org/elasticsearch/xpack/inference/mapper/InferenceMetadataFieldMapper.java

Mikep86 · 2024-03-20T20:13:08Z

...nference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java

+        MappedFieldType mappedFieldType,
+        CopyTo copyTo,
+        IndexVersion indexVersionCreated,
+        IndexAnalyzers indexAnalyzers,


It seems a bit silly that we're defining the index analyzers, given that we're not indexing or storing the text field.

The point of this text field is to store the text that generated the chunk in _source, correct? What if we made it a keyword field instead?

Yep good call, I wonder if we need to explicitly map this field though. Maybe we can just avoid creating it in the mapping entirely?

@carlosdelest WDYT?

I don't think we want to index or analyze it in any way; users can use multifields / copy_to in case they want to search for it. I'd vote for skipping the mapping.

I switched to a keyword field instead so that it's clear that the field is there but not indexed.

...erence/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextModelSettings.java

carlosdelest · 2024-03-21T08:07:44Z

...ce/src/yamlRestTest/resources/rest-api-spec/test/inference/20_semantic_text_field_mapper.yml

@@ -55,25 +56,7 @@ setup:
          index: test-index
          id: doc_1
          body:
-            non_inference_field: "you know, for testing"


These tests were meant to address the parsing of the inference results. I don't see this explicitly tested in other tests. Should we keep these?

I moved these tests to InferenceMetadataFieldMapperTests since it is not possible anymore to provide the _inference field manually in a bulk request.

carlosdelest · 2024-03-21T08:09:39Z

...nference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java

 */
 public class SemanticTextFieldMapper extends FieldMapper {
+    private static final Logger logger = LogManager.getLogger(SemanticTextFieldMapper.class);


We're not using the logger anymore I think

...nce/src/main/java/org/elasticsearch/xpack/inference/mapper/InferenceMetadataFieldMapper.java

...nce/src/test/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapperTests.java

Mikep86 · 2024-03-21T13:04:42Z

...erence/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextModelSettings.java

+                        + TASK_TYPE_FIELD.getPreferredName()
+                        + "], expected "
+                        + TEXT_EMBEDDING
+                        + "or "


String formatting error: There's no space between "TEXT_EMBEDDING" and "or"

…definition

jimczi and others added 9 commits March 14, 2024 01:52

Revert "Extract interface from ModelRegistry so it can be used from s…

7aaa3b6

…erver (elastic#105012)" This reverts commit f4d3ab9.

inference as an action filter

ebc26d2

add more tests

86ddc9d

Merge branch 'feature/semantic-text' into carlosdelest/semantic-text-…

dd73d01

…copy-to-support-inference # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java # server/src/test/java/org/elasticsearch/action/bulk/BulkOperationTests.java

Merge from feature branch

c5de0da

Refactor the semantic_text field so that it can registers all the sub…

b2b8635

…-fields in the mapping

Merge branch 'bulk_inference_ref' into register_semantic_text

64e8e43

Refatcor the semantic_text to register its sub fields in the mapping …

1c18fbc

…instead of re-creating them each time.

Merge remote-tracking branch 'upstream/feature/semantic-text' into re…

38f82fd

…gister_semantic_text

jimczi requested review from carlosdelest and Mikep86 March 20, 2024 15:30

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 20, 2024

add task_type validation

7b578d1

carlosdelest approved these changes Mar 20, 2024

View reviewed changes

Mikep86 approved these changes Mar 20, 2024

View reviewed changes

jimczi added 2 commits March 20, 2024 22:46

address review comments

2be50d7

remove unused

eb4731f

carlosdelest reviewed Mar 21, 2024

View reviewed changes

address review comments

b3fb5d3

Mikep86 approved these changes Mar 21, 2024

View reviewed changes

jimczi added 3 commits March 21, 2024 13:22

Fix the mapper builder context when updating the semantic text field …

8ddc37f

…definition

string formatting error

b3ae284

results => chunks renaming

2e7fc7f

This was referenced Mar 22, 2024

[feature/semantic-text] semantic text copy to support #106689

Merged

copy_to and multifields support for semantic_text jimczi/elasticsearch#1

Closed

jimczi merged commit d4e283d into elastic:feature/semantic-text Mar 22, 2024
11 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature/semantic_text] Register semantic text sub fields in the mapping #106560

[feature/semantic_text] Register semantic text sub fields in the mapping #106560

jimczi commented Mar 20, 2024

carlosdelest left a comment

Mikep86 left a comment

Mikep86 Mar 20, 2024

jimczi Mar 20, 2024

Mikep86 Mar 20, 2024

jimczi Mar 20, 2024

Mikep86 Mar 20, 2024

carlosdelest Mar 21, 2024

jimczi Mar 21, 2024

carlosdelest Mar 21, 2024

jimczi Mar 21, 2024

carlosdelest Mar 21, 2024

Mikep86 Mar 21, 2024

[feature/semantic_text] Register semantic text sub fields in the mapping #106560

[feature/semantic_text] Register semantic text sub fields in the mapping #106560

Conversation

jimczi commented Mar 20, 2024

carlosdelest left a comment

Choose a reason for hiding this comment

Mikep86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment